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Figure 3 
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INFORMATION MANAGEMENT SYSTEM 

This application is a continuation, of application Ser. No. 
08/144,767, filed Oct. 28, 1993, abandoned 

FIELD OF THE INVENTION 

This invention relates to an information storage, searching 
and retrieval system that incorporates a novel organization 
for presentation of search results from large (gigabytes) 
domains of archived textual data. 

BACKGROUND OF THE INVENTION 

On-line information retrieval systems are utilized for 
searching and retrieving many kinds of information. Most 
systems used today work in essentially the same manner, 
that is, users log on (through a computer terminal or personal 
microcomputer, and typically from a remote location), select 
a source of information (Le., a particular database) which is 
usually something less man the complete domain, formulate 
a query, Launch the search, and then review the search results 
displayed on the terminal or microcomputer, typically with 
documents (or summaries of documents) displayed in 
reverse chronological order. This process must be repeated 
each time another source (database) or group of sources is 
selected (which is frequently necessary in order to insure all 
relevant documents have been found). Additionally, this 
process places on the user the burden of organizing and 
assimilating the multiple results generated from the launch 
of the same query in each of the multiple sources (databases) 
that the user needs (or wants) to search. Present systems that 
allow searching of large domains require persons seeking 
information in these domains to attempt to modify their 
queries to reduce the search results to a size mat the user can 
assimilate by browsing through them (thus, potentially 
eliminating relevant results). 

In many cases end users have been forced to use an 
intennediary (Le., a professional searcher) because the cur- 
rent collections of sources are bom complex and extensive, 
and effective search strategies often vary significantly from 
one source to another. Even with such guidance, potential 
relevant answers are missed because all potentially relevant 
databases or information sources are not searched on every 
query. Much effort has been expended on refining and 
improving source selection by grouping sources or database 
files together. Significant effort has also been expended on 
query formulation through the use of knowledge bases and 
natural language processing. However, as the groupings of 
sources become larger, and the responses to more compre- 
hensive search queries become more complete, the person 
seeking information is often faced with the daunting task of 
sifting through large unorganized answer sets in an attempt 
to find the most relevant documents or information. 

SUMMARY OF THE INVENTION 

The invention provides an information storage, searching 
and retrieval system for a large domain of archived data of 
various types, in which the results of a search are organized 
into discrete types of documents and groups of document 
types so that users may easily identify relevant information 
more efficiently and more conveniently than systems cur- 
rently in use. The system of the invention includes means for 
storing a large domain of data contained in multiple source 
records, at least some of the source records being comprised 
of individual documents of multiple document types; means 
for searching substantially all of the domain with a single 
search query to identify documents responsive to the query; 
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and means for categorizing documents responsive to the 
query based on document type, including means for gener- 
ating a summary of the number of documents responsive to 
the query which fall within various predetermined categories 

s of document types. 

Preferably the means for categorizing documents and 
generating the summary includes a plurality of predeter- 
mined sets of categories of document types, and further 
includes means for automatically customizing the summary 

io by automatically selecting one of the sets of categories, 
based on the identity of the user or a characteristic of the user 
(such as the user's professional position, technical 
discipline, industry identity, etc), for use in preparing the 
summary. In this way, the summary for an individual user is 

15 automatically customized to a format that is more easily and 
efficiently utilized and assimilated. Alternately, the set of 
categories selected may be set up to allow the user to select 
a desired set of categories for use in summarizing the search 
results. 

20 The invention also relates to a method of storing, search- 
ing and retrieving information for use with a large domain 
of archived data of various types. The method involves 
storing in electronically retrievable form a large domain of 
data contained in documents obtained from multiple source 

25 records, at least some of the source records containing 
documents of multiple types; generating an electronically 
executable search query; electronically searching at least a 
substantial portion of such data based on the query to 
identify documents responsive to the query; and organizing 

30 documents responsive to the query and presenting a sum- 
mary of the number of documents responsive to the query by 
type of document independently of the source record from 
which such documents were obtained. 

35 Preferably the method also involves defining one or more 
sets of categories of document types, each category corre- 
sponding to one or more document types, selecting one of 
the sets of categories for use in presenting a summary of the 
results of the search, and then sorting documents responsive 
to the query by document type utilizing the selected set of* 

40 categories, facilitating the presentation of a summary of the 
number of documents responsive to the query which tall 
within each category in the selected set of categories. 
The selection of the set of categories to be utilized may be 

45 performed automatically based on predetermined criteria 
relating to the identity of or a personal characteristic of the 
user (such as the user's professional background, eta), or the 
user may be allowed to select the set of categories to be used. 
The query generation process may contain a knowledge 

5Q base including a thesaurus that has predetermined and 
embedded complex search queries, or use natural language 
processing, or fuzzy logic, or tree structures, or hierarchical 
relationship or a set of commands mat allow persons seeking 
information to formulate their queries. 

55 The search process can utilize any index and search 
engine techniques including Boolean, vector, and probabi- 
listic as long as a substantial portion of the entire domain of 
archived textual data is searched for each query and all 
documents found are returned to the organizing process. 

60 The sorting/categorization process prepares the search 
results for presentation by assembling the various document 
types retrieved by the search engine and then arranging these 
basic document types into sometimes broader categories that 
are readily understood by and relevant to the user. 

65 The search results are then presented to the user and 
arranged by category along with an indication as to the 
number of relevant documents found in each category. The 
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user may then examine search results in multiple formats, two SAS systems 24 are used to manage a domain of 

allowing the user to view as much of the document as the information (unless a non-stop processing system is used) to 

user deems necessary. insure maximum system availability. The number of SAS 

tm,^^^ ^« ™ m ™ .™ „ systems 24 required again depends on the volume of use the 

BRIEF DESCRIPTION OF THE DRAWINGS 5 syst em handles and the targcVresponsc time in the busiest 

FIG. 1 is a block diagram illustrating an information portion of a day; this can be determined using well-known 

retrieval system of the invention; standard queuing models associated with multitasking pro- 

FIG. 2 is a block diagram illustrating computer and cesses - 

telecommunication hardware which may be utilized in the J^t SAS systems 24 conduct the appropriate dialogue 

invention; 10 w ^ ^ enc * uscr to cuc ft a query from the user identifying 

FIG. 3 is a diagram illustrating a query formulation and what M °" Mti ° n mc us « is . ««**•»& The SAS system 24 

search process utilized in the invention; an opcrate m two ycr * **** modcs - 

FIG. 4 is a block diagram illustrating an inverted file °£ ™£ *? . users * at with a 

structure which may beutilized in the invention; and is f^S^^^T SUCha ? D f ". a K* n ? 

.. ... . . Corporation VT100 terminal (or equivalent terminal). In this 

FIG. 5 is a diagram illustrating a sorting process for mode ^ SAS system 24 generates screens of display and 

organizing and presenting search results. monitors the keyboard responses entered by thTuser to 

BEST MODE FOR CARRYING OUT THE establish the information sought and present the search 

INVENTION ^ results by category. 

. _ . The second mode supports connections from remote 

As is illustrated m the block diagram of FIG. 1, the computing systems. In this mode the SAS system 24 accepts 

information retrieval system of the invention includes an and executes transactions from a predefined set that allows 

input/output process, a query generation process, a search for a query to be generated, search to be run, and search 

process that involves a large domain of textual data rcS ults presented. In this mode the remote computing system 

(typically in the multiple gigabyte range), an organizing is in complete control of the end user's display screen and 

process, presentation of the information to the user, and a i s responsible for the look and feel of the end user activity, 

process to identify and characterize the types of documents This well-known mode of operation is commonly described 

contained in the large domain of data. as a Client/Server Architecture. 

Referring to FIGS. 1 and 2 % a user utilizes an input/output Regardless of the mode of operation, at some point the 

device to gain access to the system of the invention. Such S AS system 24 is presented with a query representing a 

input/output device may be any type of computer terminal request for information by the user. This query is composed 

capable of communicating with the searching hardware and 0 f terminology describing the various forms the information 

software. Although such a terminal might be linked directly might be stated in. typically along with Boolean connectors 

to the searching hardware and software, typically a standard to control the precision (i.e., the relevance) of documents 

personal microcomputer or work station (including a moni- retrieved 

X ° T J** S ft 0tfd l Wi ^ m ^ m 7 0 . Uld *T 8 ™ e SAS system 24 includes a display of the search server 

remote location; alternately, toe device may be snnply a lex ma J Scales me nun / e / of m me 

computer terminal(such as a vtlOO) with a modem, operated ^ column j^^g a search J^K. £ £ 

from a remote location. In each such situation, however. at^™^-,^., « 4 - „ ^ * , ' 
- queries 'are' entered «n^«V1«i«^^W 40 ^Z^^ .«£ ^Pf^^m^?m^0Sm, 

. t# . ,. , . B . *T . 26 , 26 ... . 26 each of the search clones is, in effect, a 

'I 16 * 151 ? 5 "* 1 ° n "* dCV1Ce - replica of the search engine in that column/redundancy 

Through their inputfoutput devices, remote users access being provided to permit simultaneous searching (with pre- 

the systems access control computer 20 through an X25 dictable response times) of the domain of data managed by 

public data network or similar communication means. Users ^ a particular search column. The SAS system 24 broadcasts 

may choose from a variety of standard telecommunication the user's search tome complex of searchmachines.lt waits 

systems to connect with the systems access control for a signal from a machine in each column in the complex 

computer, such as Compuserve. GTE Telenet BTTymeNet to insure that the entire domain will be searched. If after an 

Internet, etc Alternately, the user could place a direct call to appropriate time a machine in one or more columns has not 

the computing system. ^ resp onded that it has accepted the query and queued it for 

The systems access control computer 20 (or computers, if processing, the SAS 24 will inform the uscr that the search 

concurrent communication traffic requires multiple units) will not be completed across the entire domain and ask if the 

accepts calls from users and validates their personal identi- user wishes to continue. This typically would occur only if 

fication numbers. This computer 20 preferably utilizes non- multiple search engines are not operational, 

stop processing architecture such as those available from 55 If all columns respond or the user indicates that the partial 

Stratus Computer Corp.. Marlboro. Mass.. or Tandem Com- search is acceptable then the SAS 24 waits on each column 

puters Inc. Oipertino. Calif. The number of computers 20 that accepted the query to begin to report its results to it As 

required for this task typically is determined by the number these results are received each document returned is iden- 

of connections required to insure that a caller in the busiest tilled by document type and assigned to a particular category 

period of the day will have a very low probability of M in a predetermined set of categories. The system permite 

receiving a busy signal and be unable to connect to the different sets of categories to be available for use. but 

system. A user administration relational database 22 con- preferably only one set of categories is associated with a 

tarns all the information utilized by the access control single user. As described below, the various sets of catego- 

computer 20 in controlling access to the system. ries allow a single document in a domain to be placed in 

When an end user is accepted by the access control 65 different categories depending on which set of categories is 

computer 20 as a valid user, the user is then connected with being used; the selection of which set of categories is to be 

a Search Administration Server (SAS) 24. Typically at least used typically is based upon the identity of the user or a 
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predetermined characteristic of the user (such as the user's hardware processing system selected, the targeted maximum 

professional training or technical discipline or any other search response time required, the size in gigabytes of the 

relevant criteria). This facilitates presentation of search domain, the number of alternate search techniques incorpo- 

results u tilizin g terminology and groupings of document rated in the domain, the presence of gateway servers 30, and 

types that are relevant and logical to the user, preferably 5 the number of simultaneous queries that must be processed 

eliminating duplicate documents discovered in the search of in the busiest period of searching, 

the domain. An advantage that this capability gives to the As noted above, clone systems or "rows** may be created 

system is that the user' s time is saved finding relevant search within a single column based upon the expansion of simul- 

results, without compromising the thoroughness of the taneous demand. Each search clone 1 has all the same data 

search, thereby resulting in significant time savings to the 10 and all the same search capability as its corresponding 

user in comparison to a search of similar thoroughness search engine 1 (they are. in effect, redundant); multiple 

utilizing existing database sources and retrieval systems. clones are provided so that more simultaneous requests can 

When all results are reported (i.e. all columns have be processed in that particular domain with predictable 

indicated they are finished), the SAS 34 organizes the response times. It is possible (but not typical) that different 

documents into the above-described categories and in the 15 columns would have a different number of rows if they were 

correct order for display, utilizing a predetermined key (such supporting the same basic type of search activity, 

as the date of the publication, the publisher, and/or alpha- The number of gateway clones 30 required is determined 

betical priority of the document, etc.) mat is generated for fry the level of effort required to re-route and manage search 

each document when it is loaded into the database. Display queries being launched to information sources outside the 

of the information to the user is usually in reverse chrono- 20 system, and thus would be determined independently from 

logical by date published but can be based on any content of the number of search engine clones, 

the document as desired. Once the sorting is complete, Although the system of the invention is illustrated, and 

search results are presented by category to the user. generally described, as always searching substantially all of 

The Search Engine Systems (SES) 26, 27, 28, etc., (Le., the data stored in the system, it is possible to effectively 

the search engines plus the corresponding search clones) utilize the system of the invention on only selected columns 

house the documents that make up the domain of informs- 0 f the entire data domain in some circumstances. For 

tion. These systems are a collection of loosely coupled example, in some circumstances certain users may have 

engines which may, if desired, have very different architec- access to private collections of documents that are not 

tuxes and search algorithms, as may be desired based on the available to all users of me domain. These documents would 

type of material (Le., documents) they manage. Though be kept in coUections/colnmns isolated from the rest of the 

many of the SES engines may function differently, they must domain. The SAS upon recognizing that a user had fights to 

all be able to communicate with the SAS 24; this can be a private column would include it in the search. These fights 

accomplished, e.g., by having them all support an ethernet or would be found in the user administration file. 

FDDI hardware interface and the TCP/IP communication ^ Jan ^ g now t0 na 3? me query generation process 

protocol. preferably includes a knowledge base containing a thesaurus 

It is possible that a single document collection may need and a note pad, and preferably utilizes embedded predefined 

to be indexed by two or more SES units. For example, complex Boolean strategies. Such a system allows the user 

particular material that has unique indexing requirements to enter their description of the information needed using 

may be indexed ' in- a- required (or -desired) unique manner ^ simple words/phrases made up of ' 4< naturariahguage T and to r 

without imposing the technique on the entire domain. This jery on the system to assist in generating the full search 

makes the overall system much more cost effective than query, which would include, e.g., synonyms and alternate 

other systems, and is totally transparent to the query gen- phraseology. Systems of this type are known in the industry 

eration process and end user. Moreover, it facilitates effec- including, e.g., Westlaw's "WITT system (sec, e.g., 

tive and efficient search strategies, producing a high level of 45 Pritchard^choch, Natural Language Comes of Age, Online, 

relevancy in retrieval across a widely varied domain of pages 33-43, May 1993). 

information. A s illustrated in FIG. 3, a user enters a word/phrase 

An additional SES type is shown on FIG. 2 as a series of describing the technical topic about which knowledge is 

gateway engines 30. The gateway engines 30 A , 30^ . . . 30 n sought In the example illustrated in FIG. 3, the term 

allow the query being processed to be re-routed to a source 50 "AIDS" has been entered by the user. The thesaurus is 

that is external to the search server complex shown on FIG. scanned and a list of technical concepts related to the 

2. Such external sources may be housed in a completely word/phrase entered is returned. In this case, the thesaurus 

different computing system mat is remote to the main has returned concepts such as "acquired immunodeficiency 

document collection and typically not part of the business syndrome", "first aid product**, "navigational aid*\ etc. The 

unit covering search results. The gateway servers 30 con- 55 user reviews the concepts found and saves relevant ones to 

nect to such remote sources using various telecornmunica- the note pad (thereby discarding irrelevant possible conno- 

tion facilities (such as those used by end users to access the tations of the word/phrase entered). For each concept found 

irifciimu^n management system) through which they would the user can have the thesaurus show a description of the 

conduct an appropriate search and retrieve the results. concept and other concepts that are related to it. The user 

Again, such remote processes would be transparent to the g) w m be shown: 

query generation process and to the end user, with the Broader: Concepts listed under mis section are less specific 

possible exception that the response time to a query from than the one selected 

this type server would be dictated by the remote system and Narrower: Concepts listed under this section arc more 

could be substantially different from the normal SES system specific than the one selected. 

response time. $5 Related: Concepts listed under this section are related to the 

As shown in FIG. 2, the SES systems are organized in one selected. For example the concept 4 *tire" is related to 

columns. The number of columns required is dictated by the the concept "automobile.** In this example the relationship 
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is that one concept is a component of another. The software packages make use of an inverted file structure 

universe of possible relations is wide, and could include, because it dramatically speeds up retrieval, although such a 

e.g.: component of, sibling of. direct product of, opposite file structure is not strictly required, 

of, precursor of, version of. associated discipline, not A preferred fully inverted file architecture is illustrated 

related to, contrast with, used in, class of. instance of. 5 schematically in Figure 4, and is commercially available 

form of, role of, caused by, counteragent of, produces/ from Fulcrum, Ottawa, Canada. In such a system, a dictio- 

produced by, property of, measured in, and measured by. nary 34 contains an entry for each searchable term (word) in 

Process On: These are concepts that can act as a "process the document collection, with a pointer to further informa- 

on" the concept selected. For example the concept "cut- tion stored in reference file 36. The entries are ordered 

ting" can be a 1 'process on" metal io alphabetically. 

Thus, entering the word "cutting" will return concepts. Data in the reference file 36 is stored in a compressed 

under the "Process On" heading, such as metal, paper, and format and contains detailed information on the exact 

wo °d. locations of words within documents 42. This information is 

Processed By: These are concepts that can be "processed by" used to resolve phrase and proximity requests as well as 

the concept selected. For example, the concept "metal" is those for simple word combinations, 

can be '<processed by" metal cutting. Thus, entering the The index files (i.e., the dictionary and reference files) are 

word ••metal" will return concepts, under the "Process maintained by an indexing engine and are used by the search 

By" heading, such as cutting, forging, and drawing. engine to resolve queries. These files arc updated when the 

The note pad is continually updated as the user selects indexing engine is used to process the batch of documents 

additional relevant terms associated with the word/phrase 20 which have been modified or added since the last update 

for later use in creating search strategies. Users may enter cycle. 

additional words/phrases associated with the desired topic. A catalog 40 contains one entry for each document 42 in 

Users then create and execute search strategies using one or a collection. It may be thought of as defining the collection: 

more concepts saved on me note pad. The system translates all those documents 42 and only those documents with 

these concepts into complex Boolean search strategies and 25 entries in the catalog 40 are indexed and are subsequently 

automatically executes these strategies. retrievable. Each catalog entry is identified by a unique 

Referring again to the example shown in FIG. 3. after system-assigned identifier (called a catalog id or CID). 

entering the term "AIDS" the thesaurus presented a variety If a document's text is stored in an operating system file 

of possible meanings for this term. If the user selects (by outside of the catalog, the catalog entry contains physical 

entering the command 4, SE I" or an equivalent command) 30 information such as the operation system file name, the 

the first meaning presented, i.e., "acquired immune defi- filters used to read the text and the file's last modified date, 

ciency syndrome" the system automatically executes an In this manner, the catalog effects a mapping between 

embedded Boolean search strategy such as "(acquired catalog id and the operating system filename, 

immunodeficiency syndrome!) or (acquired immune defi- In addition, the catalog entry for each document may store 

ciency syndrome!) or ("AIDS" not w/10 hearing! or beauty 35 information which pertains to that document but which is 

or retention! or visual! or computer! or diagnostic! or not found in the external operating system file. This infer- 

dispersing)." This complex search strategy includes syn- mation is stored as an arbitrary number of fields, each of 

onyms for the disease, and excludes concepts wit the same which is separately indexable and searchable. Each field 

spelling but with different meanings such as hearing aids. typically contains text Numeric information, such as dates, 

^nJhe ,.user is, not required t to_^ permittmg-numericv 

anticipate all of the unintended meanings of relevant words range searching. 

utilized in the search strategy, but has been able to launch a The catalog map 38 file provides a mapping from the 

relatively sophisticated and accurate search query just by catalog id (CID) of each record to the location of corre- 

lnputting a query in "natural language". sponding data in the catalog 40. The catalog map 38 may 

Upon completion of the searching process by the search 45 also contain rnimmal status information concerning each 

processing complex in response to the query, the results of catalog entry. 

the search are presented to the user by category type (as The large (gigabytes) domain of archived textual data 

described below in greater detail). In the example of FIG. 3. searchable by the system of the invention consists typically 

the search result identified 24 experts. 59 patents. 150 of technical business and other information licensed from 

journal articles, etc. The user then can select the category to 50 database producers, information licensed from publishers, 

view-again, in the example, the user has selected category 1 and information created by the owner of the information 

by issuing the command "VI 1", and a list of the experts retrieval system (though, of course, the system may be 

identified m the search is displayed in summary format. The adapted for use with any type of information desired). The 

user can then request, by a command such as "VI CO 1", to information may be presented to the user in various formats, 

view the complete document selected from the list, giving, 55 including but not limited to abstracts, excerpts, full text or 

in this case, complete information about the identity and compound documents (i.e.. documents that contain both text 

credentials of the expert an( j graphics). 

Preferably the search process incorporates search engines FIG. 5 illustrates how five typical sources of information 

designed to utilize the Boolean method of retrieval for <i_e., source records) can be sorted into many document 

textual dam, accompanied by an inverted file structure mat 60 types and then subsequently into categories. For example a 

is utilized to speed up retrieval. Boolean logic search soft- typical trade magazine may contain several types of infor- 

ware is readily available for purchase from such companies mation such as editorials, regular columns, feature articles 

as InfoPro Technologies, McLean V; Folio, Provo, Utah; and news, product announcements, and a calendar of events' 

Fulcrum, Ottawa, Canada. Complete descriptions of the Thus, the trade magazine (Le., the source record) may be 

Boolean language and accompanying file structures are 65 sorted into these various document types, and these docu- 

available from these companies. Each supplier of Boolean ment types in turn may be categorized or grouped into 

software also specifies the file structure of the domain. Most categories contained in one or more sets of categories; each 
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document type typically will be sorted into one category The collections of textual data (Le., the source records) 
within a set of categories, but the individual categories are typically obtained either in electronic form, or are 
within each set will vary from one set to another. For obtained in hard copy form and then converted to electronic 
example, one set of categories may be established for a first form. In either case, the electronic form is loaded into the 
characteristic type of user, and a different set of categories 5 appropriate search cngine(s) of the system* During loading, 
may be established for a second characteristic type of user. mc process to identify and code information by document 
When a user corresponding to type #1 executes a search, the ^ is typically accomplished by a combination of auto- 
system automatically utilizes the categories of set #1, cor- md maaatl ^ ^ at me ^ of loading 
responding to that particular type of user, in organizing the ^ Ucate foments from multiple sources preferably are 

n£ %^£J°1 hl^^^fl^ 10 identified and removed so that toe results from a search 

from type #2 executes a search, however, the system auto- .„ . . . . , . , . ... . 

ir*tically utilizes the categories of set #2 in presenting the ™*"* red ^ £ ^P hcat< ; *|™*- 

search results to the user Duplicate documents may be identified by matching lnfor- 

Turning again, then to the trade magazine example, when ■ ination afiSodated with a document such as key words in the 

the magazine is loaded into the system, a text analysis title, authors, and date of publication. Alternately, redundant 

process identifies each unique document type within the 15 abstracts of a single title may be stored as unique text 

magazine with a code and this code is utilized by the system, segments of a single document 

in conjunction with the predetermined sets of categories, to As indicated above, the sorting process takes query search 
organize search results by document types into categories at results and sorts all relevant document identified as meeting 
the end of each search. (An alternative to marking individual the search criteria into the predcterrnined categories of 
document types with document type codes is to sort them 20 documents that arc specific to the category set corresponding 
into categories at the time they are loaded into the system to the user rather than specific to the sources/publishers of 
and then search the individual categories; however, this may the information (in contrast to existing information retrieval 
require documents to be stored more than once in the domain systems such as Dialog, etc.). Sometimes these categories 
in order to customize categories for different types of users.) may have a one-to-one relationship with the document types 
If the user corresponds to category #1 (see FIG. 5), then the 25 (for example, patents may be both a document type and a 
number of documents responsive to the search query that fall category). identified in the loading process (described above) 
into the categories of "product specifications," "manufac- or these categories may be comprised of several document 
rarer supplied descriptions," "product announcements," and types (for example, for some users product announcements, 
"trade show information" are all summarized separately. On product reviews, and product specifications may be grouped 
the other hand, if the user corresponds to category #2, then 30 into a category labeled * 'product information"), 
all of the documents responsive to the search query mat fall The results of the search and sorting processes are pre- 
within these categories are lumped together in the category sented to the user summarized by categories along with the 
'Product Information** in categories set #2. Thus, the same number of documents in each such category. Unless all 
query launched by two users corresponding to different duplicates were removed at the time the source records were 
categories will yield the same answer set, but the answer set 35 input to the system, any duplicate documents retrieved may 
will be summarized differently for the two individuals, each be removed at this time by comparing titles, authors, and 
being tailored to their particular needs. This customization publication date. The naming or labeling of categories is 
of the summary of the search results facilitates review of the based on the identity of the user (or a personal characteristic 
search results, saying ! t time the user and reporting the of the user, as detailed above) rather than the organization of. 
re^ts v m 'a"manneT that is umqu^ 40 the domain being' searched and is ar^rnplisheli^wimbuf 
The sets of categories utilized by the system may be based duplicating the document in the domain. Category labels are 
upon any relevant criteria relating to the types of users who easily changed and expanded without reloading existing 
will utilize the system. For example, the sets of categories documents as new categories are encountered as the domain 
may be based upon the professional class of the user — i.e., grows over time. In contrast, typical on-line systems cur- 
legal, business, technical, etc. Within such broad classes 45 rently in use present search results as the number of hits in 
further distinctions could be made; for example, technical reverse chronological order sorted by the data supplier or 
users could be further identified by technical discipline (such source searched. In these prior art systems the output usually 
as chemical, electrical, mechanical, medical, etc.). is a function of the order of sources selected for searching. 
Alternately, users could be identified by industry, with or Far example, to conduct a search on the topic of neon 
without regard to professional class or technical discipline 30 lasers in a typical on-line system the user must first select a 
(such as lumber, medicine, glass manufacturing, etc). Other database and then enter a search strategy. In response to the 
possible methods far determining sets of categories could search query, the user will be presented with a display such 
include geographical location of the user, the company the as "233 neon lasers". This display means there are 233 
user works for, terminology most familiar to the user, or any documents retrieved responsive to the search query. The 
other relevant user characteristic. Also, in some cases cat- 55 documents may be of any document type contained in the 
egories with identical content could be given different database selected, and are all co-mingled. Issuing a corn- 
names, again depending on the terminology most familiar or mand to display the documents retrieved will result in a 
useful to the user. Alternately, if desired, the user may be reverse chronological sort (newest to oldest) of the com- 
permitted to select which of several sets of categories should ingled document types. Moreover, only documents con- 
be used by the system in reporting results, and, if desired, 60 tained in the selected database are identified. In contrast the 
which categories of document types will be utilized in a system of the invention not only searches substantially its 
particular selected category set (i.e., the user may be able to entire domain (not just a single database or a few selected 
customize not only which category set will be used, but will databases), but also summarizes the results by category of 
be able to customize which document types will be lumped document type. 

together in a particular category anaVor what name will be 65 The user is able to view multiple formats of the docu- 

given to such a customized category containing multiple ments by category. FIG. 3 refers to a sampling of formats 

document types). that are possible, such as "short", "KWIC" (key word in 
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context), "abridged" and "complete.** Other formats can be 7. The system of claim 2 wherein the means for catego- 
utilized as desired. The formats allow the user to display all rizing documents and generating the summary includes 
or just certain portions of a document Users typically will means for redetermining the categories of document types 
scan portions of a document to ensure relevancy before based on user inputs. 

issuing the command for the complete document in order to 5 8. The system of claim 7 wherein the means for prede- 

save time and money. tennining the categories of document types includes means 
The information storage, searching and retrieval system permitting the user to customize the categories of document 
of the invention resolves the common difficulties in typical types to be utilized. 

on-line information retrieval systems that operate on large 9. The system of claim 1 wherein the means for catego- 
(c.g.. 2 gigabytes or more) domains of textual data, query rizing documents and generating the summary includes a 
generation, source selection, and organizing search results. plurality of predeterrnined sets of categories of document 
The information base with the thesaurus and embedded types, and further includes means for automatically custom- 
search strategies allows users to generate expert search izingmesuinmaryby automatically selecting one of the sets 
queries in their own "natural** language. Source (i.e., of categories, based on a code identifying the professional 
database) selection is not an issue because the search class of the user, for use in preparing the summary, so that 
engines are capable of searching substantially the entire 15 the summary for an individual user is automatically custom- 
domain on every query. Moreover, the unique presentation ^ to the user's professional class, 
of search results by category set^substantially reduces the 10 ^ inforrnation storage< searching ^ retrieval sys _ 

time and cost of performing repetitive searches in multiple tm <w i_ A . . . . . , * _7 . : 7 

databases and therefore of efficiently retrieving relevL ^ Varf0US COm ~ 

search results. 20 M 

While a preferred emrjodiment of the present invention far stadn ^ a ^ dom ain of contained in 

has been described it should be understood that various multiple document types; 

changes, adaptations and modifications may be made therein means for searching at least a portion of such data based 

without departing from the spirit of the invention and the 0,1 a search query to identify documents of multiple 

scope of the appended claims. ^ document types responsive to the query; and 

What is claimed: means for categorizing documents responsive to the query 

1. An information storage, searching and retrieval system based on document type, including means for generat- 
for large domain archived data of various types comprising: ing a summary of the number of documents responsive 

means for staring a large domain of data contained in to the query which fall within various predetermined 

multiple source records, at least some of the source jq categories of document types, 

records being comprised of individual documents of 11. The system of claim 10 wherein the means for 

multiple document types; categorizing documents and generating the summary 

means for searching at least a substantial portion of such includes a plurality of predetermined sets of categories of 

data based on a search query to identify documents of document types, each category in a set corresponding to one 

multiple types responsive to the query; and 35 or more document types. 

means for categorizing documents responsive to the query 12, Th e system of claim 11 wherein the means for 

based on document type, including means for generat- generating the summary includes means for customizing the 

ing a summary of the number of documents responsive surnmary for the user by automatically selecting one of the 

to the query which fall within various predetermined sets of categories for use in preparing the summary, such set 

r m categories of document types.,,. Vi -Ui ^ ,40 ?f categories bemg selected .based onr^4etern^ 

2. The system of claim 1 wherein the means for catego- relating to a code identifying the user or a personal charac- 
rizing documents and generating the summary includes a teristic of the user, so that the surnmary for an individual 
plurality of predeteimined sets of categories of document uscr ^ automatically customized for the user based on such 
types. code identifying the user or the personal characteristic of the 

3. The system of claim 2 wherein the means for generating 45 user - 

the summary includes means for customizing the summary 13- The system of claim 10 wherein the means for 
for the user by automatically selecting one of the sets of generating the summary includes a plurality of predetex- 
categories for use in preparing the summary, such set of ruined sets of categories of document types, each category 
categories being selected based on predetermined criteria corresponding to one or more document types, the means for 
relating to a code identifying the user or a personal charac- 30 generating the summary further including means for auto- 
teristic of the user, so that the summary for an individual matically customizing the summary by automatically select- 
user is automatically customized for the user based on such one of the sets of categories, based on a code identifying 
code identifying the user or the personal characteristic of the ^ c professional class of the user, for use in preparing the 
user. surnmary, so that the summary for an individual user is 

4. The system of claim 2 wherein the means for generating 55 automatically customized to the user's professional class, 
the summary includes means for customizing the summary 14- An information storage, searching and retrieval sys- 
for the user by permitting the user to select one of the tem for large domain archived data of various types com- 
predetermined sets of categories for use in customizing the prising: 

summary. means for storing a large domain of data contained in 

5. The system of claim 2 wherein the means for catego- 60 multiple source records, at least some of the source 
rizing documents and generating the surnmary includes records being comprised of individual documents of 
means for predetermining the sets of categories of document multiple document types; 

types based on user inputs. means for searching at least a substantial portion of such 

6. The system of claim 5 wherein the means for prede- data based on a search query to identify documents of 
terrnining the sets of categories includes means perrmtting 65 different document types responsive to the query; and 
the user to customize the set of categories of document types means for categorizing documents responsive to the query 
to be utilized. based on document type and independently of the 
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source record from which such documents were 
obtained, including means for generating a summary of 
the number of documents responsive to the query 
which fall within each of the document types. 

15. The system of claim 14 wherein the means for 
generating the summary includes one or more predetermined 
sets of categories of document types, each category corre- 
sponding to one or more document types, and further 
includes means for summarizing the number of documents 
responsive to the query which fall within the various pre- 
determined categories of a selected one of such sets of 
categories. 

16. The system of claim 15 wherein the means for 
generating the summary includes means for customizing the 
summary for the user by automatically selecting one of the 
sets of categories for use in preparing the summary, such set 
of categories being selected based on predetermined criteria 
relating to a code identifying the user or a personal charac- 
teristic of the user, so that the summary for an individual 
user is automatically customized for the user based on such 
code identifying the user or the personal characteristic of the 
user. 

17. A method of storing, searching and retrieving infor- 
mation for use with a large domain of archived data of 
various types comprising: 

storing in electronically retrievable form a large domain 
of data contained in documents obtained from multiple 
source records, at least some of the source records 
containing documents of multiple types; 

generating an electronically executable search query; 

electronically searching at least a portion of such data 
based on the query to identify documents of multiple 
document types responsive to the query; and 

sorting documents responsive to the query and presenting 
a summary of the number of documents responsive to 
the query by type of document independently of the 
source record from which such documents were 
obtained. 

* 18. A method of storing, searching and retrieving" infor- 
mation for use with a large domain of archived data of 
various types comprising: 
storing in electronically retrievable form a large domain 
of data contained in documents obtained from multiple 
source records, at least some of the source records 
containing documents of multiple types; 



10 



15 



20 



defining one or more sets of categories of document types, 
each category corresponding to one or more document 
types; 

generating an electronically executable search query; 
electronically searching at least a portion of such data 
based on the query to identify documents of multiple 
document types responsive to the query; 
selecting one of the sets of categories for use in presenting 

a summary of the results of the search; and 
sorting documents responsive to the query by document 
type and, utilizing the selected set of categories, pre- 
senting a summary of the number of documents respon- 
sive to the query which fall within each category in the 
selected set of categories. 

19. The method of claim 18 wherein the step of selecting 
one of the sets of categories is performed automatically 
based on predetermined criteria relating to a code identify- 
ing the user or a personal characteristic of the user. 

20. The method of claim 16 wherein the step of selecting 
one of the sets of categories is performed automatically 
based on a code identifying the professional class of the user, 
so that the summary for an individual user is automatically 
customized to the user's professional class. 

21. The method of claim 18 wherein substantially all of 
23 the data is searched based on the query. 

22. An information storage, searching and retrieval sys- 
tem for large domain archived data of various types com- 
prising: 

means for storing a large domain of data contained in 
multiple source records, at least some of the source 
records being comprised of individual documents of 
multiple document types; 
means for searching at least a substantial portion of such 
data based on a search query to identify documents of 
multiple types responsive to the query; and 
means for categorizing documents responsive to the query 
based on document type, including a plurality of pre- 
determined sets of categories of document types, at 
least one of the categories in at least one of the sets 
corresponding to more^than bne^documeht type, and 
means for generating a summary of the number of 
documents responsive to the query which fall within 
the various categories of one of such predetermined 
sets of categories. 
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